[SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer #42908

juliuszsompolski · 2023-09-13T12:53:22Z

What changes were proposed in this pull request?

Deflake tests in ReattachableExecuteSuite and increase CONNECT_EXECUTE_REATTACHABLE_OBSERVER_RETRY_BUFFER_SIZE.

Why are the changes needed?

Two tests could be flaky with errors INVALID_CURSOR.POSITION_NOT_AVAILABLE.
This is caused when a server releases the response when it falls more than CONNECT_EXECUTE_REATTACHABLE_OBSERVER_RETRY_BUFFER_SIZE behind the latest response it sent. However, because of HTTP2 flow control, the responses could still be in transit. In the test suite, we were explicitly disconnecting the iterators and later reconnect... In some cases they could not reconnect, because the response they last seen have fallen too fare behind.

This not only changes the suite, but also adjust the default config. This potentially makes the reconnecting more robust. In normal situation, it should not lead to increased memory pressure, because the clients also release the responses using ReleaseExecute as soon as they are received. Normally, buffered responses should be freed by ReleaseExecute and this retry buffer is only a fallback mechanism. Therefore, it is safe to increase the default.

In practice, this would only have effect in cases where there are actual network errors, and the increased buffer size should make the reconnects more robust in these cases.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

ReattachableExecuteSuite.
Did more manual experiments of how far the response sent by client can be behind the response sent by server (because of HTTP2 flow control window)

Was this patch authored or co-authored using generative AI tooling?

No.

juliuszsompolski · 2023-09-13T14:13:57Z

cc @hvanhovell @dongjoon-hyun

dongjoon-hyun · 2023-09-13T15:42:08Z

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala

@@ -139,7 +139,7 @@ object Connect {
          "With any value greater than 0, the last sent response will always be buffered.")
      .version("3.5.0")
      .bytesConf(ByteUnit.BYTE)
-      .createWithDefaultString("1m")
+      .createWithDefaultString("10m")


Since the purpose is to Deflake ReattachableExecuteSuite and increase retry buffer, shall we increase this only at ReattachableExecuteSuite instead of touching the default value?

I explained in the PR description that I think that increasing this default is a genuine improvement that will help make reconnects more robust in case of actual network issues, while not increasing memory pressure in a normal scenario (where the client controls the flow with ReleaseExecute)
Since Spark 3.5 released before that suite was added, making this change now is low risk change at this point before the next release, and it will have good baking-in time before next release.

I would rather not be changing it in the suite, because that suite is suppose to stress test how the actual client behaves when faced with disconnects. If we changed it in the suite, that would be sweeping under the carpet that I think from the experiments performed now that it was a bit too small for retries robustness.
This is not a major issue, and this in practice only applies in situations when connect faces real intermittent connectivity issues, where before this was implemented, it would just fail.

juliuszsompolski · 2023-09-14T15:39:11Z

...ctor/connect/server/src/test/scala/org/apache/spark/sql/connect/SparkConnectServerTest.scala

-class SparkConnectServerTest extends SharedSparkSession {
+trait SparkConnectServerTest extends SharedSparkSession {


having it as a class was making it execute as a suite with no tests, but still doing the beforeAll / afterAll.

dongjoon-hyun · 2023-09-14T22:08:59Z

It seems that another suite starts to fail. Is it related?

LuciferYang · 2023-09-15T05:57:26Z

Just to confirm, will the case mentioned by #42560 (comment) also be fixed in this PR?

juliuszsompolski · 2023-09-15T15:23:55Z

@dongjoon-hyun I don't think the SparkConnectSessionHolderSuite failures are related, and I don't know what's going on there.

Streaming foreachBatch worker is starting with url sc://localhost:15002/;user_id=testUser and sessionId 9863bb98-6682-43ad-bc86-b32d8486fb47.
Traceback (most recent call last):
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
    import pandas
ModuleNotFoundError: No module named 'pandas'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/streaming/worker/foreach_batch_worker.py", line 86, in <module>
    main(sock_file, sock_file)
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/streaming/worker/foreach_batch_worker.py", line 51, in main
    spark_connect_session = SparkSession.builder.remote(connect_url).getOrCreate()
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/session.py", line 464, in getOrCreate
    from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/session.py", line 19, in <module>
    check_dependencies(__name__)
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies
    require_minimum_pandas_version()
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/pandas/utils.py", line 34, in require_minimum_pandas_version
    raise ImportError(
ImportError: Pandas >= 1.0.5 must be installed; however, it was not found.
[info] - python foreachBatch process: process terminates after query is stopped *** FAILED *** (1 second, 115 milliseconds)

Streaming query listener worker is starting with url sc://localhost:15002/;user_id=testUser and sessionId ab6cfcde-a9f1-4b96-8ca3-7aab5c6ff438.
Traceback (most recent call last):
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
    import pandas
ModuleNotFoundError: No module named 'pandas'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/streaming/worker/listener_worker.py", line 99, in <module>
    main(sock_file, sock_file)
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/streaming/worker/listener_worker.py", line 59, in main
    spark_connect_session = SparkSession.builder.remote(connect_url).getOrCreate()
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/session.py", line 464, in getOrCreate
    from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/session.py", line 19, in <module>
    check_dependencies(__name__)
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies
    require_minimum_pandas_version()
  File "/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/pandas/utils.py", line 34, in require_minimum_pandas_version
    raise ImportError(
ImportError: Pandas >= 1.0.5 must be installed; however, it was not found.
[info] - python listener process: process terminates after listener is removed *** FAILED *** (434 milliseconds)
[info]   java.io.EOFException:

it looks to me like some (intermittent?) environment issue.

juliuszsompolski · 2023-09-15T15:24:59Z

@LuciferYang I tried looking at #42560 (comment) but did not reproduce it yet. If you have more instances of CI runs where it failed with that stack overflow, that could be useful.
Inspecting the code, I don't see how that iterator could get looped like that...

dongjoon-hyun

+1, LGTM. Thank you, @juliuszsompolski and all.
Given the current situation, I believe we proceed further after merging this.

dongjoon-hyun · 2023-09-16T02:07:15Z

Merged to master.

LuciferYang · 2023-09-18T03:07:21Z

@LuciferYang I tried looking at #42560 (comment) but did not reproduce it yet. If you have more instances of CI runs where it failed with that stack overflow, that could be useful. Inspecting the code, I don't see how that iterator could get looped like that...

It seems that this issue is relatively easy to reproduce on Github Action. In the past three days, daily tests on Scala 2.13 have all experienced StackOverflowError

LuciferYang · 2023-09-18T03:22:42Z

dev/change-scala-version.sh 2.13
build/sbt "connect/test" -Pscala-2.13

@juliuszsompolski When I run the above command during local test, it is easier to reproduce StackOverflowError.

LuciferYang · 2023-09-18T13:31:35Z

@LuciferYang I tried looking at #42560 (comment) but did not reproduce it yet. If you have more instances of CI runs where it failed with that stack overflow, that could be useful. Inspecting the code, I don't see how that iterator could get looped like that...

Although it seems quite magical, after #42981 eliminated an ambiguous references in org.apache.spark.sql.connect.client.CloseableIterator, the test abandoned query gets INVALID_HANDLE.OPERATION_ABANDONED error no longer occurs java.lang.StackOverflowError in my local tests, let's monitor the GA test with Scala 2.13.

LuciferYang · 2023-09-19T13:51:17Z

@dongjoon-hyun @juliuszsompolski I think branch-3.5 also need this pr due to #42560 has also been merged into the branch-3.5

hvanhovell · 2023-09-19T15:24:32Z

Let me backport it to 3.5.

… increase retry buffer ### What changes were proposed in this pull request? Deflake tests in ReattachableExecuteSuite and increase CONNECT_EXECUTE_REATTACHABLE_OBSERVER_RETRY_BUFFER_SIZE. ### Why are the changes needed? Two tests could be flaky with errors `INVALID_CURSOR.POSITION_NOT_AVAILABLE`. This is caused when a server releases the response when it falls more than CONNECT_EXECUTE_REATTACHABLE_OBSERVER_RETRY_BUFFER_SIZE behind the latest response it sent. However, because of HTTP2 flow control, the responses could still be in transit. In the test suite, we were explicitly disconnecting the iterators and later reconnect... In some cases they could not reconnect, because the response they last seen have fallen too fare behind. This not only changes the suite, but also adjust the default config. This potentially makes the reconnecting more robust. In normal situation, it should not lead to increased memory pressure, because the clients also release the responses using ReleaseExecute as soon as they are received. Normally, buffered responses should be freed by ReleaseExecute and this retry buffer is only a fallback mechanism. Therefore, it is safe to increase the default. In practice, this would only have effect in cases where there are actual network errors, and the increased buffer size should make the reconnects more robust in these cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ReattachableExecuteSuite. Did more manual experiments of how far the response sent by client can be behind the response sent by server (because of HTTP2 flow control window) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42908 from juliuszsompolski/SPARK-44872-followup. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

hvanhovell · 2023-09-19T15:26:02Z

I have cherry-picked this to 3.5.

LuciferYang · 2023-09-20T11:37:49Z

Thanks @hvanhovell

f

f6aadee

github-actions bot added SQL CONNECT labels Sep 13, 2023

juliuszsompolski force-pushed the SPARK-44872-followup branch from 7e774e2 to f6aadee Compare September 13, 2023 12:53

style

550d26c

dongjoon-hyun reviewed Sep 13, 2023

View reviewed changes

dongjoon-hyun mentioned this pull request Sep 13, 2023

[SPARK-45144][BUILD] Downgrade scala-maven-plugin to 4.7.1 #42899

Closed

HyukjinKwon approved these changes Sep 14, 2023

View reviewed changes

trait

3ed5f30

juliuszsompolski commented Sep 14, 2023

View reviewed changes

dongjoon-hyun approved these changes Sep 16, 2023

View reviewed changes

dongjoon-hyun closed this in ab46dc0 Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer #42908

[SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer #42908

juliuszsompolski commented Sep 13, 2023

juliuszsompolski commented Sep 13, 2023

dongjoon-hyun Sep 13, 2023

juliuszsompolski Sep 13, 2023

juliuszsompolski Sep 14, 2023

juliuszsompolski Sep 14, 2023 •

edited

Loading

dongjoon-hyun commented Sep 14, 2023

LuciferYang commented Sep 15, 2023

juliuszsompolski commented Sep 15, 2023

juliuszsompolski commented Sep 15, 2023

dongjoon-hyun left a comment

dongjoon-hyun commented Sep 16, 2023

LuciferYang commented Sep 18, 2023 •

edited

Loading

LuciferYang commented Sep 18, 2023

LuciferYang commented Sep 18, 2023

LuciferYang commented Sep 19, 2023 •

edited

Loading

hvanhovell commented Sep 19, 2023

hvanhovell commented Sep 19, 2023

LuciferYang commented Sep 20, 2023

		class SparkConnectServerTest extends SharedSparkSession {
		trait SparkConnectServerTest extends SharedSparkSession {

[SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer #42908

[SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer #42908

Conversation

juliuszsompolski commented Sep 13, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

juliuszsompolski commented Sep 13, 2023

dongjoon-hyun Sep 13, 2023

Choose a reason for hiding this comment

juliuszsompolski Sep 13, 2023

Choose a reason for hiding this comment

juliuszsompolski Sep 14, 2023

Choose a reason for hiding this comment

juliuszsompolski Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 14, 2023

LuciferYang commented Sep 15, 2023

juliuszsompolski commented Sep 15, 2023

juliuszsompolski commented Sep 15, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 16, 2023

LuciferYang commented Sep 18, 2023 • edited Loading

LuciferYang commented Sep 18, 2023

LuciferYang commented Sep 18, 2023

LuciferYang commented Sep 19, 2023 • edited Loading

hvanhovell commented Sep 19, 2023

hvanhovell commented Sep 19, 2023

LuciferYang commented Sep 20, 2023

juliuszsompolski Sep 14, 2023 •

edited

Loading

LuciferYang commented Sep 18, 2023 •

edited

Loading

LuciferYang commented Sep 19, 2023 •

edited

Loading